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Abstract 

We consider classification problems in which the la¬ 
bel space has structure. A common example is hierar¬ 
chical label spaces, corresponding to the case where one 
label subsumes another (e.g., animal subsumes dog). 
But labels can also be mutually exclusive (e.g., dog vs 
cat) or unrelated (e.g., furry, carnivore). To jointly 
model hierarchy and exclusion relations, the notion of a 
HEX (hierarchy and exclusion) graph was introduced in 
[8]. This combined a conditional random field (CRF) 
with a deep neural network (DNN), resulting in state 
of the art results when applied to visual object classi¬ 
fication problems where the training labels were drawn 
from different levels of the ImageNet hierarchy (e.g., 
an image might be labeled with the basic level category 
”dog”, rather than the more specific label ’’husky”). In 
this paper, we extend the HEX model to allow for soft 
or probabilistic relations between labels, which is useful 
when there is uncertainty about the relationship between 
two labels (e.g., an antelope is ’’sort of” furry, but not 
to the same degree as a grizzly bear). We call our new 
model pHEX, for probabilistic HEX. We show that the 
pHEX graph can be converted to an Ising model, which 
allows us to use existing off-the-shelf inference methods 
(in contrast to the HEX method, which needed special¬ 
ized inference algorithms). Experimental results show 
significant improvements in a number of large-scale vi¬ 
sual object classification tasks, outperforming the pre¬ 
vious HEX model. 


1. Introduction 

Classification is a fundamental problem in machine 
learning and computer vision. In this paper, we con¬ 
sider how to extend the standard approach to exploit 
structure in the label space. For example, consider the 
problem of classifying images of animals. The labels 
may be names of animal types (e.g., dog, puppy, cat), 
or attribute labels (e.g., yellow, furry, has-stripes). 
Many of these labels are not semantically independent 


of each other. For example, a puppy is also a dog, 
which is a hierarchical or subsumption relation; an an¬ 
imal cannot be both a dog and a cat, an exclusive re¬ 
lation; but an animal can be yellow and furry, which is 
a non-relation. 

In [8], an approach called Hierarchy and Exclusion 
(HEX) graphs was proposed for compactly represent¬ 
ing such constraints between the labels. In particular, 
a probabilistic graphical model with deterministic or 
hard constraints between the binary label nodes was 
proposed. These hard constraints cut down the fea¬ 
sible set of labels from 2 n (where n is the number of 
labels) to something much smaller, allowing for efficient 
exact inference. For example, if all labels are mutually 
exclusive, the HEX graph is a clique, and there are 
only n + 1 valid label configurations. This graphical 
model can be combined with any standard discrimina¬ 
tive classifier (such as deep neural networks), resulting 
in a conditional random field (CRF) model with label 
constraints. 

In this paper, we extend the HEX model by allow¬ 
ing for “soft” relationships between the labels. We 
call this the pHEX model. The pHEX model has five 
main advantages compared to the HEX model. First, 
it is a more realistic model, since the relationship be¬ 
tween most labels is “soft”. For example, a lion may 
be mostly yellow, but it could also be another color. 
Second, the pHEX model is easier to train, since the 
likelihood function is smoother. Third, we show how to 
perform inference in the pHEX model by converting it 
to an Ising model, and then using standard off-the-shelf 
tools such as belief propagation, or the emerging quan¬ 
tum optimization technology [11 ]. This is in contrast to 
the HEX case, which needed a specialized (and rather 
complex) algorithm to perform inference. Fourth, we 
show how to combine binary labels with k- ary labels, 
something that wasn’t possible with the original HEX 
model. Finally, we show that the pHEX model out¬ 
performs the HEX model on a variety of visual object 
classification tasks. 
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2. Related work 

There has been a lot of prior work on exploiting 
structure in the label space; we only have space to men¬ 
tion a few key papers here. Conditional random fields 
[12, 24] and structural SVMs [23] are often used in 
structured prediction problems. In addition, in trans¬ 
fer learning [20, 19], zero-shot learning [13, 17], and 
attribute-based recognition [ , 26, 21], consistency be¬ 
tween visual predictions and semantic relations are of¬ 
ten enforced. 

More closely related to this paper is work that ex¬ 
ploits hierarchical structure (e.g., [27, 14, 25, 16]), ex¬ 
clusive relations [5], or both of them [6, 15]. Recently 
[ ] proposed the HEX graph approach, which subsumes 
a lot of prior work by modeling hierarhical and exclu¬ 
sive relations using graphical models. We discuss this 
in more detail in Section 3, since it forms the founda¬ 
tion for the current paper. 

3. The HEX model 

In a nutshell, HEX graphs are probabilistic graphi¬ 
cal models with directed and undirected edges over a 
number of binary variables. Each binary variable rep¬ 
resents a label and takes value from { — 1,1}. Each 
edge or no-edge between any two labels represents one 
of three label relations: exclusion, hierarchy and non- 
relation. The combination of all pairwise label rela¬ 
tions allows the HEX graph to characterize the legal 
and illegal state space of labels, as we explain below. 

3.1. HEX relations 


No relation When two nodes are not connected by 
any edge, we say there is no relation between them. 
This means that the two labels are independent of each 
other. For example, carnivore and yellow are indepen¬ 
dent properties of animals. In this case, the legal state 
space for the two variables contains all 4 possible con¬ 
figurations: 

5 ° 4 {(- 1 ,- 4 ), (- 1 , 1 ), ( 1 ,- 1 ), ( 1 , 1 )}. ( 3 ) 

3.2. HEX graph as a graphical model 

To mathematically formulate the HEX model, as¬ 
sume we have a set of n possible labels, represented as 
the bit vector y = {yi, ... ,y n }, where y { e {-1,+1}. 
Also, assume we have an input feature vector x = 
{xi , ...,£d}, and some discriminative model which 
maps this to the score vector z = {^i,... ,z n j, where 
Zi is the “local evidence” for label y^. (The mapping 
from x to z is arbitrary; in this paper, we assume it is 
represented by a deep neural network parameterized by 
w, which we will denote by z = DNN{x.\ w).) Given 
this, we can define the model as follows: 

1 n 

p ( y i x ) = zfx) n Mvi,yj), w 

1 ’ i =1 (iJ)eG 

where ^p(yi^Zi) = 1/(1 -f exp(—2 yiZi)) is the logistic 
function, and (j){yi,yj ) is the (edge-specific) potential 
function, defined below: (We use the notation (\> a to 
represent an “absolute” or deterministic potential, to 
distinguish it from the soft or probabilistic potentials 
we use later, denoted by (f) p .) 


The three types of label relations in the HEX graph 
are defined as follows: 

Exclusion When two nodes are connected by an 
undirected edge , this is called an exclusive relation. It 
means that the two labels cannot be both equal to 1. 
For example, an animal cannot be both a cat and a 
dog. So cat and dog are mutually exclusive. The legal 
state space for exclusion is: 


• Exclusion 


0a (2/1 >2/2) = 


Hierarchy 


4> h a{yi,V2) = 


( 2 / 1 , 2 / 2 ) e S e 
(2/1,2/2) = ( 1 , 1 ); 


( 2/1 , 2/2 ) e S h 
( 2 / 1 , 2 / 2 ) = (-1,1); 


(5) 


(6) 


S e ± {(-1,-1), (-1,1), (1, -1)}. (1) 

Hierarchy When two nodes are connected by a di¬ 
rected edge from yi to 7 / 2 , this is called a subsumption 
(hierarchical) relation. It means that if is 1 then y\ 
must be 1 as well. For example, a puppy is always a 
dog. So dog subsumes puppy. The corresponding legal 
state space for subsumption is: 

S h = {(-1,-1), (1,-1), (1,1)}. (2) 


• No relation 

C ( 2/1 > 2 / 2 ) = 1 V( 2 /i, 2 / 2 ). (7) 

4. Probabilistic HEX models 

In this section, we introduce an extension of the 
HEX model to allow for soft or probabilistic relation¬ 
ships between labels. The basic idea is to relax the 
hard constraints, by replacing the value 0 (correspond¬ 
ing to illegal combinations) in the definitions of the 



potential functions with a value 0 < q < 1, represent¬ 
ing how strongly we wish to enforce the constraints. 
(This is somewhat analogous to the approach used in 
Markov logic networks [18], which relax the hard con¬ 
straints used in first order logic.) In principle, q can 
be estimated from data along with the parameters of 
the unary potentials (discussed in Section 6), but in 
this paper, we either tie the g’s across all edges, or set 
them based on prior knowledge of the strength of the 
relations. 

4.1. Probabilistic HEX relations 

For clarity, we now explicitly specify the form of 
the two new factors we introduce. We use the generic 
parameter q to represent the strength of this relation, 
although this could easily be made edge/ label depen¬ 
dent. 


4.2. Converting pHEX models to Ising models 

The main disadvantage of this relaxation is that we 
lose the ability to perform tractable exact inference. 
However, we now show that we can formulate pHEX 
models as Ising models, which opens up the door to us¬ 
ing standard tractable approximate inference methods. 
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(a) Probabilistic exclusive relation (b) The equivalent Ising model 

Figure 1. (a) Probabilistic exclusive relations in a pHEX 

graph with 0(1,1) = q\ (b) the coefficients on the nodes and 
the edge of the equivalent Ising model, where q = exp(—4n). 


Probabilistic exclusion The potential function of 
the two variables yi , 7/2 under probabilistic exclusion is 
defined as: 


0 e P {yi,y2;q) 


1 ( 2 / 1 , 2 / 2 ) e S e 

q ( yi , 2 / 2 ) = ( 1 , 1 ), 


(8) 


where 0 < q < 1. When q = 1, Equation (8) reduces to 
the non-relation in Equation (7), where yi and yj are 
independent. When q = 0, Equation (8) reduces to the 
hard exclusion relation Equation (5), where ( 2 / 1 , 2 / 2 ) = 
(1,1) is strictly prohibited. 


Probabilistic hierarchy For hierarchy (subsump¬ 
tion), we define 


4>p{yi,V2\q) 


1 ( 2 / 1 , 2 / 2 ) e S h 
q ( 2 / 1 , 2 / 2 ) = (-1,1), 


(9) 


where 0 < q < 1. This reduces to the unconstrained 
relation when q = 1; and reduces to the hard subsump¬ 
tion relation when q = 0. 

Probabilistic exclusions and subsumptions can be 
seen as a probabilistic mixture of absolute exclusions, 
subsumptions, and non-relations, where 


-u u 

©^-© ©——© 

(a) Probabilistic subsumption relation (b) The equivalent Ising model 

Figure 2. (a) Probabilistic subsumption relations in a pHEX 
graph with 0(—1,1) = q\ (b) the coefficients of the equiva¬ 
lent Ising model, where q = exp(—4i0. 

The Ising model was first proposed in statistical me¬ 
chanics to study ferromagnetism [3]. Mathematically, 
it is essentially an undirected graphical model which 
defines the joint distribution of configurations of n bi¬ 
nary random variables y in graph G by a Boltzmann 
distribution, 


Pp( y) = T exp(-/?£(y)), (10) 

©3 

where Zp is the normalization constant, and 0 is a tem¬ 
perature variable that will be omitted later by fixing it 
to 1. E( y) is the energy function of the configuration 
y, which takes into account local energy potentials hiyi 
as well as pairwise energy potential Jijyiyj , 

n 

E(y) = ^ ^ JijViVj T 'y ^ hjyj. (11) 

(iJ)eG i =1 


(/> e p(yi,y2]q) = 20 aG/l >2/2) + (! -^) 0 a(2/1J2/2), 
4 > p ( yiiy 2 \ q ) = gC (2/1,2/2) + (1 - 9) (2/1»2/2)- 

Therefore, the combination of probabilistic label re¬ 
lations generalizes the absolute label relations in the 
HEX graph. 


To convert a pHEX graph to an Ising model, we first 
show how to convert the factor functions rf> p ( 2 / 1 , 2 / 2 ) f° r 
the pairwise probabilistic relations to the equivalent 
pairwise energy functions i?( 1 / 1 , 1 / 2 ) of an Ising model. 

Consider an Ising model of two variables in Figure 
1(b), where u > 0 are the weights on the local poten¬ 
tials and the pairwise potential. The resulting pairwise 










energy function of this Ising model is, 


rewrite Equation (4) as follows: 


Ep(yi,y 2 -, u) = uyiy 2 + uy 1 + uy 2 

= {~ U fe-»2) eS6 ( 12 ) 

\3m (y 1 ,y 2 ) =* (1,1)- 

Clearly, Equation (12) looks very similar to Equa¬ 
tion (8). In fact, by letting q = exp(— Au) and 
4>p(yi,y2;q) oc exp(-E^(y 1 ,y 2 -,u)), we can show they 
are equivalent up to a constant factor. To see this, let 
{y 1 , 1 / 2 ) be a legal label pair, and (^,^ 2 ) be an illegal 
pair. We have 

^ P (yi,y 2 )/<Pp(y'i,y 2 ) = i /q = e 4u , 
exp (-E( yi ,y 2 ) + E(y[,y' 2 )) = e u+3u = e 4u . 


A larger u means a stronger exclusion between the two 
labels. When u —» + 00 , Equation (12) reduces to the 
hard exclusive relation; conversely when u = 0, Equa¬ 
tion (12) reduces to the non-relation. 

Similarly, the equivalent Ising model of the proba¬ 
bilistic subsumption is shown in Figure 2(b), where, 

Ep(yi,y 2 ; u) = -uyxy 2 - uy t + uy 2 

= {~ u ( 2/1 > 2 / 2 ) e S h 
\3 u (y 1 ,y 2 ) = (-1,1). 

We set </>p(yi,y 2 -,q) oc exp{-E£(y 1 ,y 2 ;u)) and q = 
exp(—4zz). 

The product of the pairwise factor functions 
4>p{yi,Vj', qij) can now be written in terms of the sum 
of pairwise energy functions E(yi , yj ; ): 


n <t>p(yi,yj,Qij) exp 
(ij)ec 


^ ^ Jijywj ^ ^ h%yi 

(i,j)eG i= 1 


where 


Jij — 



(ij) G ex. 

(hj) V (j,z) G sub. 


(14) 


hi — ^ ^ 'Uij ^ ^ 'Uki T ^ ^ uu. 

{j\(i,j)Eex.} {k\(k,i)Esub.} { l\(i,l)Esub .} 

(15) 


Here ex. denotes the set containing all exclusive rela¬ 
tions and sub. the set containing all subsumption rela¬ 
tions. Note that all the pairs (z, j) G ex. satisfy i < j , 
and pairs (z, j) G sub. means z subsumes j. 

To incorporate local evidence into the model, we can 


p(y I z ) exp I ^ log ip(yi, 2 ») — ^ E (Vi, Vj> u ij) ) 
V =1 (hi)eo J 

n 

= exp(- ^2 JioViVo ~ E( hi ~ 

(i,j)eG i =1 

where and hi are from Equation (14) and Equation 
(15). Note that we omitted a constant from the logip 
term, because it will be canceled out by the normal¬ 
ization constant Z. By defining h\ = hi — Zi, we can 
“absorb” the local evidence into the Ising model, and 
use standard inference methods. 

4.3. Inference in pHEX models 

At test time, we need to compute the marginal dis¬ 
tribution per label, p(yi |z). In multi-label classifica¬ 
tion problems, a label yi is predicted to be true if 
p(yi | z) > 0.5. In multi-class classification problems, 
the label 

y* = argmaxp(^| z) 
i=1 

is predicted to be the true label. At training time, we 
need p(yi \ z) as well as the term p(yi\yj = 1, z), where 
some of the true observed labels {e.g. for node j) are 
set to their desired target states. 

Exact inference in pHEX models is usually in¬ 
tractable, when the graphs are loopy, and the le¬ 
gal states are not sparse. Since p(y | z) is an 
Ising model, we can apply any off-the-shelf inference 
method, including mean-field inference (MF), loopy 
belief propagation (LBP), and Markov Chain Monte 
Carlo (MCMC) methods [ ]. In practice, we find that 
the standard LBP algorithm works consistently well, 
so we use it as our main inference algorithm in our 
experiments. We give the details below. 

We define the belief on each label yi to be bi(— 1) 
and bi( 1), and the message from yi to its neighbour 
yj to be rrii^j(— 1) and ra*_^-(l). Then the algorithm 
iterates through all beliefs and messages with updates, 

bi(l) oc exp(-ft') JJ nij-n( 1), 
jeN(i) 

bi(-l) oc exp(ft-) JJ TOj_>i(-l), 
jeN(i) 


where N(i) denotes the neighbours of z, and 


TOj—^l) oc exp(— J^) +exp(Jij) ^ ^ 


m«j( 1) 


oc exp (-Jij) — -p- - + exp(Jy) 


bi( 1) 


l^J 


(- 1 ) 


mi^j{ 1 )' 






To maintain numerical stability, we normalize bi and 
rrij^i throughout inference, and we perform updates in 
the log domain. After all beliefs have converged or a 
maximum number of iterations has been reached, we 
estimate the marginal probabilities by p(yi = 1 1 z) = 

Mi). 

The inference of p(yi\yj = 1, z) is almost the same as 
above except we set bj( 1) = 1 and bj(— 1) = 0 to rep¬ 
resent the fact that node j is clamped to state 1. (We 
can easily extend this procedure if we have multiple 
clamped nodes.) 


5. Mutually exclusive and collectively ex¬ 
haustive relations 

In addition to allowing soft relations, our pHEX 
framework offers another advantage over HEX graphs: 
it is easy to enforce a new type of constraint, 
namely Mutually Exclusive and Collectively Exhaus¬ 
tive (MECE) relations, used in the multi-class softmax 
model. In HEX graphs, there is no way to express 
the notion of “collectively exhaustive”, z.e., one of the 
mutually exclusive classes must be true. HEX graph 
thus has to maintain an additional “none of the above” 
state. 

In the pHEX graph, we handle the MECE relation of 
k nodes using a single multinomial variable with k pos¬ 
sible states. Although an undirected graphical model 
with multinomial nodes is strictly speaking not an Ising 
model, a slight variant on the standard LBP algorithm 
can still be applied for efficient approximate inference. 

For simplicity, we only illustrate the inference al¬ 
gorithm for pHEX graphs with one multinomial la¬ 
bel node, since this will be used in later experiments. 
Further generalization to pHEX graphs with multiple 
multinomial nodes is straightforward and follows simi¬ 
lar procedures. 

Let us denote the multinomial node by c = 
{ci,...,Cfc}. The node and message updates for the 
standard binary nodes are the same as before. The 
belief of the multinomial node c is updated as, 


b c {i) oc exp(-/i') JJ mj^ c (i) 
jeN(c) 


for state i E {1, ..., k} in which y Ci =1. Here N(c ) = 
u£_i N(ci) is the neighbour set of the multinomial node. 
The message from a standard node j to the multinomial 


node c is, 


k 

ocexp(y J jCs 

S= 1 


2 JjCi) 


bjjH 


k 

+ exp(- X Jjc s -f 2 Jj Ci ) 

S=1 


bj(-l) 


for state i. The message from the multinomial node to 
a standard node j is, 


k k ^ /^\ 

1) OC e xp(V Jj Cs — 2 JjCi)— 7TT-, 

i=i S =1 

k k (j\ 

1) 777 -. 

i=l s = 1 m j^ c W 

As in the standard LBP algorithm, we normalize fr c , 
nij^c and ra c _^- and update them in the log domain. 
After the algorithm converges, the marginal probability 
of a node Ck in clique c is p(y Ck = 1| z) = b c (k). 


6. Learning 

An important property of the (p)HEX model is that 
not all the target labels need to be specified during 
training. For example, consider a data set of images. 
It is more common for a user to use basic level category 
names, such as “dog”, than very specific names such 
as “husky” or “beagle”. Furthermore, a user may not 
label everything in an image. So the absence of a label 
is not evidence of its absence. 

To model this, we allow some of the labels to be 
unobserved or hidden during training. For example, 
if we clamp the “husky” label to true, and leave all 
other label nodes unclamped, the hard constraints will 
force the “dog” label to turn on, indicating that this 
instance is an example of both the husky class and 
the dog class. However, if we clamp the “dog” label 
to true, we will not turn on “husky” or “beagle”, since 
the relation is asymmetric. We can also clamp labels to 
the off state, if we know that the corresponding class 
is definitely absent. For example, turning on “dog” 
will turn off “cat” if they are mutually exclusive. (In 
the pHEX case, the “illegal” states are down weighted, 
rather than given zero probability.) 

Let the input scores for the 6’th training instance 
be z 6 , and let the subset of target labels be t b = 
..., £^), where we have assumed that m labels are 
observed in every instance for notational simplicity. A 
natural loss function is the negative log likelihood of 
the observed labels given the inputs: 

N m 

^=-xx log pivtj = l|z b ). 

6=1 j=1 




To fit the local classifiers (unary potentials), we first 
need to derive the gradient of the loss wrt the input 
scores Zi. The derivative of log p(yt j = 1| z) over some 
Zi is, 


5 log p(y t . = 11 z) 

dzi 


~ ^p(yi\ytj=i,z) [Vil ^p(yi\z)[yi\- 


Therefore, we need to compute the conditional dis¬ 
tributions p(yi\ytj = l,z) and marginal distributions 
p(jji \ z) for all i. These correspond to the well-known 
“clamped” and “unclamped” phases of MRF / CRF 
learning. We can then backpropagate the gradient into 
the parameters of the local classifiers themselves. 

We can use a similar gradient-based training scheme 
to estimate the CRF edge parameters. However, in this 
paper, we simply combine prior edge weights from data 
with a one-dimensional grid search of rescaling factor. 


7. Experiments 

In [8], the HEX graphs shows significant improve¬ 
ment over standard softmax and (multi-label) logistic 
regression models, so in this paper, we will just com¬ 
pare pHEX to HEX. We conduct three experiments. 

The first experiment is the standard ImageNet im¬ 
age classification problem [7]. We add hierarchical rela¬ 
tions between the labels based on the publicly available 
WordNet hierarchy. Since WordNet does not have ex¬ 
clusive relations, we assume that any two labels are ex¬ 
clusive if they are not in subsumption relation. Figure 
3 Left is an example of the subgraph of ” fish”. As in 
[8], we assume that the training labels are drawn from 
different levels of the hierarchy. In this paper, we show 
that pHEX with (constant) soft relations improves on 
HEX, especially when leaf labels are rarely present in 
the training set. 
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Figure 3. Left: An illustration of the (p)HEX graph based 
on the WordNet hierarchy in the ImageNet experiments. 
Right: An illustration of the (p)HEX graph in the Ani¬ 
mal with Attributes experiments. The blue directed edges 
denote the subsumption relations; and the red undirected 
edges denote the exclusive relations. An MECE relation 
(multinomial node) is placed in the final pHEX graph. 


The second experiment is a zero-shot learning task, 
in which we must predict unseen classes at test time, 
leveraging known relations between the class labels and 
attributes of the class. We use the Animals with At¬ 
tributes dataset [13]. Following [8], we first assume that 
all object classes are mutually exclusive. We then add 
subsumption relations from a predicate (or attribute) 
to an object if the binary predicate of the object is 1, 
and add exclusive relations between predicate and ob¬ 
jects if the binary predicate of the object is 0. See the 
illustration in Figure 3 Right. In this paper, we relax 
the hard constraints and show that pHEX can work 
significantly better than HEX. Finally, the third ex¬ 
periment is another zero-shot learning task, this time 
on the PASCAL VOC/ Yahoo images with attributes 
dataset [10]. Again, we show that pHEX can signifi¬ 
cantly outperform HEX. 

7.1. Experimental setup 

Table 1. The Ising coefficients u as well as the corresponding 
strengths of the label relations q used in pHEX graphs in 
the experiments, where q — exp(—4n). 


U 

0 

0.1 

0.3 

0.5 

0.7 

1.0 

1.5 

q 

l 

0.67 

0.30 

0.14 

0.06 

0.02 

0.002 


In our experiments, we used two types of pHEX 
graphs. For the ImageNet experiments, we use the 
same constant edge strength for all edges; we vary this 
edge paramter u across the ranges shown in Table 1, 
and plot results for each value. For the zero-shot ex¬ 
periments, we consider constant edge weights, but we 
also consider variable edge weights, which we derive by 
scaling the prior edge weight (derived from the data) 
by a global scale factor u , which we again vary across 
a range. 

Note that, since all three tasks are evaluated on test 
labels in a multi-class setting, we add a MECE relation 
into the pHEX graphs In particular, for the ImageNet 
dataset, we add a multinomial node on the 1000 leaf 
labels; in the Animal with Attributes dataset, we add 
a multinomial node on the 50 animal classes; and in 
the VOC/Yahoo dataset, we add a multinomial node 
on the 32 object classes. After adding MECE relations, 
we remove the replicated soft exclusive relations from 
the pHEX graph. 

7.2. ImageNet classification experiments 

In this section, we use the ILSVRC2012 dataset [7], 
which consists of 1.2M training images from 1000 ob¬ 
ject classes. These 1000 classes are mutually exclusive 
leaf nodes of a semantic hierarchy based on WordNet 









































Figure 4. Top-1 (top) and Top-5 accuracies (bottom) vs relation strength u for the ImageNet classification experiment. The 
results of the pHEX graphs are in the red solid curves, and the results of the HEX graphs are in the blue dashed horizontal 
lines. From left to right: relabeling 50%, 90%, 95%, 99%. 


that has 860 internal nodes. As in [8], we evaluate the 
recognition performance in the multiclass classification 
at the leaf level, but allow the training examples to be 
labeled at different semantic levels. Since ILSVRC2012 
has no training examples at internal nodes, we cre¬ 
ate training examples for internal nodes by relabelling 
{50%, 90%, 95%, 99%} of the leaf examples to their 
immediate parents based on the WordNet Hierarchy. 
Since the ground truth for test set is not released for 
ILSVRC2012, we use 10% of the released validation set 
as our validation set and the other 90% as our test set. 

The underlying feed-forward network that we use 
is based on a deep convolutional neural network 
GoogLeNet [22]. Since GoogLeNet is such a large 
model, we adopt the following staged training proce¬ 
dure. First we pre-train a CNN with a HEX graph as 
the top layer until convergence. Then we fine tune the 
entire model with pHEX graph layers of different co¬ 
efficients u on top. This can be thought of as a form 
of curriculum learning [2] by training with a simpler 
model (HEX graph) with exact inference first. 

Figure 4 shows the Top-1 (top row) and Top-5 (bot¬ 
tom row) accuracies across classes as a function of u , for 
the relabeling experiments. For comparison, the Top-1 
(top row) and Top-5 (bottom row) accuracies without 
relabeling (i.e., the standard ImageNet setup) is 70.1% 
and 90.0% respectively. Not surprisingly, relabeling 
(i.e., only providing some labels at the leaves, and us¬ 
ing coarser grained categories for the rest) hurts perfor¬ 
mance (as estimated by leaf-level accuracy). However, 
in this regime (which occurs commonly in practice), 
pHEX generally outperforms HEX, especially for 90%, 
95% and 99% relabeling, where the accuracies improve 
by 2%, 3% and 8% respectively. (Note that a 1% dif¬ 
ference in performance is considered statistically sig¬ 


nificant on this problem due to the large size of this 
dataset.) 

At first, it might seem odd that relaxing the hard 
constraints imposed by the hierarchy can help, since 
the hierarchy provided by WordNet is supposed to be 
correct. However, [8] observed that too few training ex¬ 
amples labeled at leaf nodes (especially at 99% relabel¬ 
ing) may confuse the leaf models, especially at the be¬ 
ginning of the training. As the algorithm runs longer, 
it becomes harder to recover from a bad local mini¬ 
mum because the constraints in the HEX graph are 
hard constraints. By contrast, in the pHEX graph, the 
weaker relations between internal nodes and leaf nodes 
make the resulting posterior distribution smoother, so 
it is easier to overcome bad local minima for the pHEX 
graph in later iterations. 

It is also interesting to see that the optimal value 
of u appears to depend on the relabeling percentage. 
When a larger portion of training examples are rela¬ 
beled, e.g. 99% relabeling, the optimal relation coef¬ 
ficient becomes weaker {u = 0.1). This indicates that 
weaker label relations are preferred when there is more 
uncertainty in the leaf labels. 

On the other hand, when u is large, the label rela¬ 
tions become quite certain and the the pHEX graph be¬ 
comes closer to the HEX graph. In the case of u = 1.5 
(q ~ 0.002), the performance of pHEX graph can be¬ 
come worse than HEX graph, probably due to the in¬ 
ability to perform exact inference in the pHEX graph. 

7.3. Zero shot learning experiments 

We use two datasets to illustrate zero shot learn¬ 
ing. The first is the Animals with Attributes dataset 
[13], which includes images from 50 animal classes. For 
each animal class, it provides both binary and continu- 
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Figure 5. Mean accuracy per class vs relation strength u for the Zero-shot Learning Experiments. Left: animals with 
attributes. The results of the pHEX with variable edge weights are in the dotted green, and the ones with constant edge 
weights are in solid red. The results of the HEX graphs are in the blue dashed horizontal lines. Right: VOC/Yahoo images 
with attributes. 


ous predicates for 85 attributes. We convert the binary 
predicates to constant (soft) relations, and the contin¬ 
uous predicates to variable soft relations by a mono¬ 
tonic mapping function. The details of the mapping are 
provided in the supplementary material. We evaluate 
the zero-shot setting where training is performed using 
only examples from 40 animal classes (with 24295 im¬ 
ages) and testing is on classifying the 10 unseen classes 
(with 6180 images). Our experimental results are based 
on 5-fold cross validation. The underlying network is 
a single-layer network whose inputs come from the re¬ 
cently released DECAF features [ ]. 

The second dataset is the aPascal-aYahoo dataset 
[ .0], which consists of a 12695 image subset of the PAS¬ 
CAL VOC 2008 dataset and 2644 images that were col¬ 
lected using the Yahoo image search engine. The PAS¬ 
CAL part serves as training data and has 20 object 
classes. The Yahoo part serves as test data and con¬ 
tains 12 different object classes. Each image has been 
annotated with 64 binary attributes that characterize 
shape, material and the presence of important parts 
of the visible object. We convert them to binary and 
continous predicates for attributes per object by aver¬ 
aging the image annotations for every object (details 
in supplementary material). The underlying network 
is again a single-layer network whose inputs come from 
the features that the authors of [10] extracted from the 
objects bounding boxes (as provided by the PASCAL 
VOC annotation) and released as part of the dataset. 
Once again we use 5-fold cross validation and compares 
constant soft relations and variable soft relations with 
hard relations. 

Figure 5 shows the mean accuracy per class (along 
with standard errors) vs u. We see that pHEX is gen¬ 
erally significantly outperforming HEX. In particular, 
when u £ [0.1,1.5] for Animals with Attributes and 
u £ [0.3,1.0] for VOC/Yahoo, the difference is statisti¬ 
cally significant at the 5% level according to a paired 


t-test. The accuracies of the pHEX graph get closer 
to the ones of the HEX graph as u becomes larger and 
the pHEX graph approaches to the HEX graph. More¬ 
over, the pHEX models with variable soft relations im¬ 
proves over the ones with constant soft relations by 2% 
for Animals with Attributes and 1% for VOC/Yahoo. 
This demonstrates the value of adding additional in¬ 
formation in the variable probabilistic label relations 
in transfer learning. 

7.4. Speed comparison of HEX vs pHEX 

In the ImageNet experiements, the cost of HEX and 
pHEX is similar, since most of the time is spent eval¬ 
uating the underlying deep CNN. In the two zero-shot 
learning experiments, the inference time of the pHEX 
graph is about the same as the one of the HEX graph. 
Furthermore, many other algorithms such as quantum 
annealing [ 1] (which are faster and/or more accurate 
than loopy belief propagation) have been devised for 
Ising models which we could try in the future. 

8. Conclusions 

In this paper, we studied object classification with 
probabilistic label relations. In particular, we proposed 
the pHEX graph, which naturally generalizes the HEX 
graph. The pHEX graph is equivalent to an undirected 
Ising model, which allows for efficient approximate in¬ 
ference methods. We embed the pHEX graph on top 
of a deep neural network, and show that it outper¬ 
forms the HEX graph on a number of classification 
tasks which require exploiting label relations. 

There are several possible future directions of this 
work. One idea is to learn the Ising coefficients of the 
pHEX graph together with the underlying neural net¬ 
work parameters. Another is to combine the pHEX 
graph into a larger framework which exploits spatial 
relations between objects. 
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